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We propose a distinct numerical approach to effectively solve the problem of partial diagonaliza¬ 
tion of the super-large-scale quantum electronic Hamiltonian matrices. The key ingredients of our 
scheme are the new method for arranging the basis vectors in the computer’s RAM and the algo¬ 
rithm allowing not to store a matrix in RAM, but to regenerate it on-the-ffy during diagonalization 
procedure. This scheme was implemented in the program, solving the Anderson impurity model in 
the framework of dynamical mean-field theory (DMFT). The DMFT equations for electronic Hamil¬ 
tonian with 18 effective orbitals that corresponds to the matrix with the dimension of 2.4 x 10^ were 
solved on the distributed memory computational cluster. 


Nowadays one can barely imagine a modern physical 
or computational problem which could be solved without 
application of the matrix calculus. Indeed, if any more 
or less intricate equation system is under our investiga¬ 
tion [1-3], we need a reliable algorithm to treat the basic 
matrix operations like diagonalization. This procedure 
is known to be strictly necessary in the various fields of 
computational science from the latent semantic analysis 
in linguistics [4] to the investigation of correlated crystals 
in condensed matter physics [5]. At present the modern 
numerical algorithms and libraries such as LAPACK [6] 
can be used for the full diagonalization of the matrices 
with dimension of a few 10000. But as the size of the 
matrix to be dealt with grows further, the natural solu¬ 
tion of the problem is to replace the full diagonalization 
by partial one. This approach is applicable if the key in¬ 
formation about physical system can be extracted from 
a small part of the eigenvalue spectrum. 

That is the case of the modern quantum physics. In or¬ 
der to describe a physical system at low temperatures we 
need to calculate only little amount of the lowest eigen¬ 
values. One uses them to model the ground state and the 
spectrum of low-energy excitations. And this is exactly 
what the partial diagonalization technique is designed to. 

In our study we use the exact diagonalization (ED) 
approach to solve the Anderson impurity model [7]. In 
terms of this model the impurity is assumed to be em¬ 
bedded in an effective electron bath instead of the being 
placed into crystal lattice. It allows us to investigate 
the behaviour of the system in the strongly correlated 
regime, when the electron-electron interaction and the 
kinetic energy are of the same order. Along with the 
natural physical application, the same model is exploited 
in the self-consistent cycle to solve the equations of dy¬ 
namical mean-field theory (DMFT) [7]. 

In the simplest case the Hamiltonian of the Anderson 
model can be written in the following form: 
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where a is the spin index, Cpa{c^a-) 
electron annihilation(creation) operators for the bath 
sites and the impurity site correspondingly, Vpo is the 
impurity-bath hybridization, noa = d^^doa- is the particle 
number operator and U is the on-site Coulomb interac¬ 
tion. 

In general case the number of bath sites is infinite. In 
the framework of the ED method Eq.(I) is approximated 
by its reduced version, constructed with the finite number 
of bath sites. Appropriateness of this approximation is 
thoroughly discussed in Ref. [7]. 

The sum of the bath and impurity sites forms the total 
number of sites, i.e. the number of the effective electronic 
orbitals Ns taken into account. In the pioneer study [7] 
Ns was ranged from 5 to 12, but using power of the mod¬ 
ern computational clusters it can be increased up to 17 
[8]. This limit generally rises because of restriction of 
the computation resourse, typically being the amount of 
RAM per each processor. Therefore to make a step for¬ 
ward we need to find a way to modernize the algorithms 
in use to reduce the RAM space required. The natural 
solution is to develop an algorithm allowing not to keep 
the matrix in RAM, but to recalculate it (on-the-fiy) on 
the each step of iterative diagonalizing process. It was 
applied to treat the ultra-large sparse spin Hamiltonian 
matrix with the dimension 2.7 x 10^^ [9] and in this study 
we accomplished it to solve the electronic problem of the 
scale 2.4 x 10^. 

As one can see from Eq.(I), there are no operators 
changing the total number of electrons in the system. 
Therefore, being presented in matrix form, Hamiltonian 
(I) has a block structure, where each block corresponds 
to a fixed amount of the spin-up and spin-down electrons, 
and does not mix to any other one. Therefore, they can 
be diagonalized independently. The A^^=I and A^4,=I 
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FIG. 1. The example of the Hamiltonian matrix block in case 
of and for the one impurity and one bath site 

system. The basis vectors are constructed in accordance with 
Fig. 2. 
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FIG. 2. Basis vector in occupation numbers representation. 

number of spin-up(down) electrons on the impu- 
rity(bath) states, total number of spin-up(down) elec¬ 

trons in corresponding configuration. 

block for the simple system, consisting of one impurity 
and one bath site is shown in Fig. 1. 

The main procedure of the numerical diagonalization is 
the matrix-vector multiplication. In this sense the most 
preferable approach is considered to be Arnoldi method 
[10], which requires at least the initial and resulting vec¬ 
tor to keep them in the computer RAM. If we generate 
the electronic Hamiltonian instead of its storage in mem¬ 
ory, we have to make it in an effective way. In this paper 
we propose a high-performance method for the partial 
diagonalization of the electronic Hamiltonian matrices. 
It consists of the basis optimization module and matrix- 
vector multiplication module. Each of them we will de¬ 
scribe below. 

Basis optimization. The first step to diagonalize 
Anderson-type electronic Hamiltionan is to optimize the 
way of the basis treating. Mainly we need a fast algo¬ 
rithm to handle the basis states. Each of the states can 
be determined by combination of the occupation numbers 
(Eig. 2), which, in turns, can be represented in binary 
code of 2Ns length {Ng digits per each spin projection). 

In this representation, the full basis for the quantum 
system, consisting of 17 effective electronic orbitals, is a 
huge dimension of 2*17*4^^ binary bits, which requires 
about 6 Tb RAM. If we keep this basis in the block form, 
the amount of the required memory will be decreased to 
2.5 Gb, but it is still not suitable. 

We propose a new method of the basis storage. In 
Table II (col. 1) we show the basis of the Hamiltonian 
matrix block = 2, = 1 for the simple quantum 

system of three sites. It is shown that if the correspond¬ 
ing numbers in the binary representation are allocated in 
ascending order, both (spin-up and spin-down) electronic 
configurations also were built in the same way. One can 
see that spin-up binary representation changes only when 
the spin-down one overflows like a scale of notation. 


In general case, if we construct all the basis vectors, 
corresponding to any block for the system with 

arbitrary Ng the spin-up configuration being external is 
observed going from the lowest to the highest binary rep¬ 
resentation only once, whereas the spin-down one being 
internal goes times. Thereby, to build all basis vec¬ 
tors we should treat all matrix blocks successively, 

where ranges from 0 to A^. 

We stress that for the system of Ng electronic orbitals 
there is no difference between spin-up and spin-down 
electronic configurations, and they are completely deter¬ 
mined only by number of electrons A^ and A^. There¬ 
fore, to construct any basis vector of this system one 
needs to store just all possible sets of electronic configu¬ 
rations, characterized by the fixed number of electrons, 
varying from 0 to A^, regardless to the spin projection. It 
is illustrated in the Table II (col. 2-3), where each basis 
vector of the matrix block was transformed into combi¬ 
nation of the corresponding binary representations. In 
terms of this example we have only 8 configurations with 
the length of 3 {Ng) binary digits (Table I) to produce any 
basis state of the whole Hamiltonian matrix instead of 9 
vectors with the length of 6 {2Ng) binary digits to treat 
only single block = 2, A^ = 1 not to mention the full 
basis, containing 64 vectors. Hence, in that case we have 
to store 3 X 8 = 24 binary digits instead of 6 x 64 = 384 
ones. In this way, required RAM space for reproducing 
any basis state in binary representation for quantum sys¬ 
tem with Ng = 17 is 1.7 Mb. It is of crucial importance 
when we are dealing with ultra-large-scale matrices. 


TABLE I. All possible electronic configurations for the quan¬ 
tum system with A = 3. Uconf is the ordinal number of 
configuration. 


Hconf 

Number of electrons 

0 

1 

2 

3 

1 

000 

001 

on 

111 

2 

- 

010 

101 

- 

3 

- 

100 

no 

- 


Matrix-vector multiplication module. One of the most 
essential aspects of the numerical on-the-fly diagonaliza¬ 
tion is the optimization of the matrix-vector multipli¬ 
cation. The author of Ref. 8 proposed an algorithm, 
consisted in building some new individual vector on each 
processor from only the necessary pieces of original one 
(Eig. 3) and use it to perform multiplication. This vec¬ 
tor is much smaller, because of a sparse structure of the 
Hamiltonian matrix. 

That is suitable to minimize the number of interpro¬ 
cessor communications, but in the case of a super-large- 
scale matrix even one piece of original vector is large, 
which complicates its transfer between nodes. To solve 
this problem we used the fact that because of sparse 
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TABLE IL Basis and its representation in terms of individual 
electronic configurations for the quantum system with Ns = 
3. Ticonf is the ordinal number of the conhguration. 


Basis vector 

Electronic configurations 

Spin up, Nf = 2, 

nconfIConf 

Spin down, Ni = 1, 
nconfIConf 

on 001 

1 / oil 

1 / 001 

on 010 

1 / oil 

2 / 010 

on 100 

1 / oil 

3 / 100 

101 001 

2 / 101 

1 / 001 

101 010 

2 / 101 

2 / 010 

101 100 

2 / 101 

3 / 100 

no 001 

3 / 110 

1 / 001 

no 010 

3 / 110 

2 / 010 

no 100 

3 / 110 

3 / 100 
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FIG. 3. Construction of the vector on node from the pieces 
of original vector. 

structure of the matrix not every element of the vector 
contributes to the result. And the contributing elements 
are distributed facultatively. Therefore, we can transfer 
not the whole vector pieces, but its short versions from 
minimum indexed to maximum indexed contributing el¬ 
ement (Fig. 4). It means that the very first contributing 
element amin >= and the very last one amax <= dp 
can be readily defined, making possible to cut the ’’left 
and right tails” [ai,a^in-i] and [a^aa^+i, a^]. 

For the calculations with 128 and more processors en¬ 
gaged the total amount of the transmittable data was re¬ 
duced to 50-55 % in comparison with whole pieces trans¬ 
fer, providing a productivity boost and decrease of the 
required RAM. 

Performance estimation. To demonstrate performance 
the developed program was applied to solve the Hub¬ 
bard model [7] on the square lattice by using the DMFT 
method. Such a model is widely used for the electronic 
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FIG. 4. Transfer of minimum to maximum necessary elements 
of the piece. 



FIG. 5. Dependence of the diagonalization time on the num¬ 
ber of engaged processors. 



FIG. 6. Density of states of Hubbard model in square lattice 
for metallic (a) and insulator (b) cases. 

and magnetic structure simulation of the superconduct¬ 
ing copper oxides [11]. 

One of the most important criterion of parallel method 
performance is its scalability. To test it we carried 
out the series of partial diagonalizations of the matrix 
11778624 X 11778624, which corresponds to the largest 
block of the Hamiltonian matrix of the electronic model 
with 14 effective orbitals. All the calculations were per¬ 
formed on Np processors with the clock rate of 2.2 GHz. 
As it is shown in Fig. 5, test curve is close to the ideal 
one, therefore we obtain a remarkable scaling. Results of 
these tests were compared with ones, provided from pro¬ 
gram based on compressed row storage (CRS) [8] method 
to keep Hamiltonian matrix in RAM (Tables HI and IV). 
Moreover, for a first time this model was solved with 
18 effective electronic orbitals involved. The density of 
states, obtained for metallic (a) and insulating (b) case 
using Lanczos algorithm [12] are shown in Fig.6. 


TABLE III. Performance comparison of the developed pro¬ 
gram and another one, where GRS method is used to store 
the Hamiltonian matrix in RAM: Memory required. 


N. 

Matrix size 

Np 

RAM, Mb 

Economy, 

% 

On-the-fly 

CRS 

14 

11 778 624 

32 

39 

300 

87 

16 

165 636 900 

256 

73 

500 

85.4 

17 

590 976 100 

512 

131 

1000 

86.9 

18 

2 363 904 400 

512 

502 

- 

- 
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TABLE IV. Performance comparison of the developed pro¬ 
gram and another one, where CRS method is used to store 
the Hamiltonian matrix in RAM: Diagonalization time. 


Ns 

Matrix size 

Np 

Total time, sec 

On-the-fly 

CRS 

14 

11 778 624 

32 

120 

188 

16 

165 636 900 

256 

600 

602 

17 

590 976 100 

512 

1480 

1300 

18 

2 363 904 400 

512 

6548 

- 


Conclusion. The new technique for treating the super- 
large-scale sparse Hamiltonian matrices was developed. 
An effective arrangement of the basis vectors and numer¬ 
ical scheme allowing not to keep the matrix in the RAM, 
but to recalculate it during the diagonalization proce¬ 
dure resulted in considerable economy of the calculation 
resources which turned to be near 87%. It gives us an 
opportunity to simulate the quantum impurity systems 
with 18 electronic orbitals. 
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